dnadna.datasets

Utilities for loading data from data sets, including support for different data set formats:

  • Classes for reading different dataset formats. Datasets are collections of SNP files for multiple scenarios, possibly with multiple replicates per scenario:

    • The NpzSNPSource class reads a data set of multiple parameter scenarios with (possibly) multiple replicates per scenario, stored in NPZ files in a particular filesystem layout, known as the DNADNA Format. This is the default data set format understood by DNADNA.

    • The DictSNPSource class reads a JSON-based data set format which is less efficient both in terms of storage compactness and parsing/serializing, that allows plain-text storage of SNP data. Currently this is used primarily in testing.

  • The DNATrainingDataset and its simpler base class DNADataset are implementations of a PyTorch Dataset used for loading SNP data (in the form of SNPSamples along with their associated scenario parameters, for both training sets and validation sets during model training. This works independently of what the dataset format is (the dataset format is implemented as an SNPSource such as the two listed above, which is an abstract interface for arbitrary dataset formats). (TODO: There is currently no SNPSource base class, but one should be implemented in order to help define the interface.)

Classes

DNADataset([config, validate, source, ...])

Simplified base class for DNADNA datasets which simply maps an integer index to an SNPSample instance from the simulation dataset.

DNATrainingDataset([config, validate, ...])

DatasetTransformationMixIn(config[, ...])

Partially implemented Dataset which accepts parameters

DictSNPSource(scenarios[, position_format, ...])

SNP source that reads from a JSON-like data structure consisting of a dict with (simulation, replicate) pairs for keys, and SNPSamples in JSON-compatible format for values (see to_dict).

FileListSNPSource(filenames)

SNP source that returns scenarios from a fixed list of arbitrary files.

NpzSNPSource(root_dir, dataset_name[, ...])

SNP source that reads simulation data as SNPSamples stored on disk in DNADNA's native "dnadna" format.

SNPSource()

A "SNPSource" is a class for loading SNPSample objects from some data source.

Exceptions

MissingSNPSample(scenario, replicate, path)

Exception raised when a specified sample is not found in an SNP source.

class dnadna.datasets.DNADataset(config={}, validate=True, source=None, scenario_params=None, scenario_set=None, cached_set=None)[source]

Bases: ConfigMixIn, Dataset

Simplified base class for DNADNA datasets which simply maps an integer index to an SNPSample instance from the simulation dataset.

This has two modes of operation: One where a scenario_params table is given as a pandas.DataFrame in the format described for the DNADNA Format. In this case, all the scenarios and replicates described in that table are returned (where they exist), and for each item in the dataset a (scenario_idx, replicate_idx, snp_sample, scenario_params) tuple is returned.

In the second mode of operation, scenario_params is not given, and the data sources are simply looped over directly. In this case a 4-tuple of (scenario_idx, replicate_idx, snp_sample, None) is returned for each item.

The DNATrainingDataset is the more complete implementation which can perform additional transformations on the data when used in model training, and which keeps separate training and validation sets.

Given a scenario_set=<scenario_idx> argument, only the data in a single scenario are returned; this may also be a list/set of scenario indices to consider.

property cached_set

Indices whose samples should be cached in memory.

config_schema = 'dataset'

The schema against which this class should validate its config Config by default.

May be either the name of one of the built-in schemas (see Config.schemas) or a full schema object.

classmethod from_config_file(filename, *args, validate=True, source=None, scenario_params=None, scenario_set=None, **kwargs)[source]

Load the Config from a file.

Additional kwargs are passed to from_file.

The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.

get(index, ignore_missing_replicates=None)[source]

Same as DNATrainingDataset.__getitem__ but adds additional optional arguments.

Parameters:

index (index of the sample to get from the dataset) –

Keyword Arguments:

ignore_missing_replicates (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the ignore_missing_replicates option in the dataset configuration, but this allows overriding the config file.

class dnadna.datasets.DNATrainingDataset(config={}, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None)[source]

Bases: DatasetTransformationMixIn

config_schema = 'training'

The schema against which this class should validate its config Config by default.

May be either the name of one of the built-in schemas (see Config.schemas) or a full schema object.

classmethod from_config_file(filename, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None, **kwargs)[source]

Load the Config from a file.

Additional kwargs are passed to from_file.

The additional keyword arguments are passed to the dict serializer, and the config is validated against the training schema.

class dnadna.datasets.DatasetTransformationMixIn(config, transforms=None, param_set=None, **kwargs)[source]

Bases: DNADataset

Partially implemented Dataset which accepts parameters for transforming the SNP data returned from the data source.

how to know if ubunt
transforms`list`how to know if u

list giving transform names or transform descriptions (a transform name plus its parameters) as specified in the dataset_transforms property in the training config file. See also ref:schema-training. May also contain instances of Transform.

param_setParamSet

ParamsSet object representing all the details of the parameters to learn in training, including the values of those parameters for the training and validation sets (the pre-processed scenario params); information about the parameters can be used by some transforms.

Additional positional and keyword arguments are passed to super().__init__() so that this can be used as a mix-in with arbitrary DNADataset subclasses.

get(index, ignore_missing_replicates=None)[source]

Same as DNATrainingDataset.__getitem__ but adds additional optional arguments.

Parameters:

index (index of the sample to get from the dataset) –

Keyword Arguments:

ignore_missing_replicates (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the ignore_missing_replicates option in the dataset configuration, but this allows overriding the config file.

get_split_set(split_type)[source]

Get the set of indices for the specified split type. Args:

split_type (str or list of str): The split type(s) to retrieve.

Returns:

frozenset: A frozenset of indices for the specified split type(s).

property test_set

Set of indices to use for testing.

property training_set

Set of indices to use for training.

property transforms

The composed set of transforms to apply to the dataset.

Either dnadna.transforms.Compose or a dict mapping dataset splits (“training”, “validation”, “test”) to their corresponding Compose of transforms.

property validation_set

Set of indices to use for validation.

class dnadna.datasets.DictSNPSource(scenarios, position_format=None, filename=None, lazy=True)[source]

Bases: SNPSource

SNP source that reads from a JSON-like data structure consisting of a dict with (simulation, replicate) pairs for keys, and SNPSamples in JSON-compatible format for values (see to_dict).

Currently used just by the test suite, but may be useful in other contexts as well (e.g. serialization of simulations).

Parameters:

scenarios (dict) – dict with (simulate, replicate) tuple keys, and values in the format output by to_dict, or the values may also be SNPSample instances (useful for testing).

Keyword Arguments:
  • position_format (dict) – (optional) – Position format dict corresponding to the pos_format argument to SNPSample (currently all samples in the dataset are assumed to have the same position formats).

  • filename (str) – (optional) – If the scenarios dict was read from a file (e.g. a JSON or YAML file) this can be set to the filename; this is used just as a convenience when reporting errors.

  • lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not converted from the dict format until needed. Use lazy=False to ensure that the data is immediately converted.

Examples

>>> from dnadna.datasets import DictSNPSource
>>> from dnadna.snp_sample import SNPSample
>>> sample = SNPSample([[0, 1], [1, 0]], [0.1, 0.2])
>>> source = DictSNPSource({(0, 0): sample.to_dict()},
...                        filename='scenario_0_0.json')
>>> source.scenarios
{(0, 0): {'SNP': ['01', '10'], 'POS': [0.1, 0.2]}}
>>> (0, 0) in source
True
>>> source[0, 0]
SNPSample(
    snp=tensor([[0, 1],
                [1, 0]], dtype=torch.uint8),
    pos=tensor([0.1000, 0.2000], dtype=torch.float64),
    pos_format={'normalized': True},
    path='scenario_0_0.json'
)

If the requested sample doesn’t exist in the dataset a MissingSNPSample exception is raised:

>>> (0, 1) in source
False
>>> source[0, 1]
Traceback (most recent call last):
...
dnadna.datasets.MissingSNPSample: could not load scenario 0 replicate 1
from "scenario_0_0.json": KeyError((0, 1))
name = 'dict'

The user-facing name of the plugin, which can be provided by a user implementing a plugin.

Typically it is automatically the same as the internal Pluggable._name but users are free to provide their own custom name here when implementing a plugin.

plugin_url = 'py-obj:dnadna.schemas.plugins.snp_source.dict'

Base URL for all DNADNA plugins.

New plugins’ schemas can be found relative to this URL unless this attribute is explicitly overridden by the class implementing the plugin.

class dnadna.datasets.FileListSNPSource(filenames)[source]

Bases: object

SNP source that returns scenarios from a fixed list of arbitrary files.

Because the concepts of “scenarios” and “replicates” are not necessary applicable to an arbitrary list of files, each file is considered a single scenario of one replicate (e.g. source[3, 0] returns the contents of the fourth file in the list.

exception dnadna.datasets.MissingSNPSample(scenario, replicate, path, reason=None)[source]

Bases: Exception

Exception raised when a specified sample is not found in an SNP source.

class dnadna.datasets.NpzSNPSource(root_dir, dataset_name, filename_format=None, keys=('SNP', 'POS'), position_format=None, lazy=True)[source]

Bases: SNPSource

SNP source that reads simulation data as SNPSamples stored on disk in DNADNA’s native “dnadna” format.

Each simulation is stored in a NumPy NPZ file containing two arrays, by default keyed by 'SNP' for the SNP matrix, and 'POS' for the positions array.

There is one .npz file for each replicate of each scenario, laid out in a filesystem format. The exact layout and filename can be specified by the filename_format argument to this class’s constructor, but the default layout is as specified in NpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT, which is also the documented format assumed by the “dnadna” format.

Parameters:
  • root_dir (str, pathlib.Path) – The root directory of the DNADNA dataset. All filenames generated from the filename_format are appended to this directory.

  • dataset_name (str) – The name of the dataset–same as that specified in the simulation config for this dataset.

Keyword Arguments:
  • filename_format (str) – (optional) – A string in Python format string syntax specifying the format for filenames of individual simulations in this dataset. The format string can contain 3 replacement fields: {dataset_name} which is filled in with the model name given by the dataset_name parameter above, {scenario} which is filled with the scenario index, and {replicate} which is filled with the replicate index. If the scenario and replicate indices are zero-padded in the filenames, the amount of zero-padding may be explicitly specified by writing the format string like {scenario:05} (for scenario indices padded up to 5 zeros). However, if no-zero padding is specified in the format string, the appropriate amount of zero-padding is automatically guessed by filenames actually present in the dataset. Therefore the default filename_format, NpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT can be used regardless of the amount of zero-padding used in a given dataset.

  • keys (tuple) – (optional) – A 2-tuple of (snp_key, pos_key) giving the keywords for the SNP matrix and the position array in the NPZ file. The default ('SNP', 'POS') is the default for the “dnadna” format, but different names may be specified for these arrays.

  • position_format (dict) – (optional) – The format of the position arrays in the dataset (currently all samples in the dataset are assumed to have the same position formats). Corresponds to the pos_format argument to SNPSample.

  • lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not read from disk until needed. Use lazy=False to ensure that the data is immediately loaded into memory.

Examples

>>> import numpy as np
>>> from dnadna.datasets import NpzSNPSource
>>> from dnadna.snp_sample import SNPSample
>>> tmp = getfixture('tmp_path')  # pytest-specific

Make a few random SNP and position arrays:

>>> dataset = {}
>>> filename_format = 'my_model_{scenario:03}_{replicate:03}.npz'
>>> for scenario_idx, replicate_idx in zip(range(2), range(2)):
...     snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
...     pos = np.sort(np.random.random(10))
...     sample = SNPSample(snp, pos)
...     filename = tmp / filename_format.format(
...         scenario=scenario_idx, replicate=replicate_idx)
...     sample.to_npz(filename)
...     dataset[(scenario_idx, replicate_idx)] = sample

Instantiate the NpzSNPSource and load a couple samples:

>>> source = NpzSNPSource(tmp, 'my_model', filename_format=filename_format)
>>> source[0, 0]
SNPSample(
    snp=tensor([[...],
                ...
                [...]], dtype=torch.uint8),
    pos=tensor([...], dtype=torch.float64),
    pos_format={'normalized': True},
    path=...Path('...my_model_000_000.npz')
)
>>> source[0, 0] == dataset[0, 0]
True
>>> source[1, 1] == dataset[1, 1]
True
>>> source[2, 0]
Traceback (most recent call last):
...
dnadna.datasets.MissingSNPSample: could not load scenario 2 replicate 0
from "...my_model_002_000.npz": FileNotFoundError(2, 'No file matching or
similar to')
DEFAULT_NPZ_FILENAME_FORMAT = 'scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz'

Default format string for filenames relative to the root_dir of an NpzSNPSource.

This is the default filesystem layout for the DNADNA format. Each scenario has its own directory named scenario_<scenario_idx> where the scenario_idx is typically zero-padded the correct amount for the total number of scenarios in the dataset.

Each simulation file in a scenario has the filename <model-name>_<scenario_idx>_<replicate_idx>.npz where both scenario_idx and replicate_idx are again zero-padded an appropriate amount.

In a simulation config with the option {"data_source": {"format": "dnadna"}}, this default filename format can be overridden with the {"data_source": {"filename_format": "..."}} option.

classmethod from_config(config, validate=True)[source]

Instantiate an NpzSNPSource from a simulation Config matching the simulation schema.

name = 'dnadna'

The user-facing name of the plugin, which can be provided by a user implementing a plugin.

Typically it is automatically the same as the internal Pluggable._name but users are free to provide their own custom name here when implementing a plugin.

plugin_url = 'py-obj:dnadna.schemas.plugins.snp_source.dnadna'

Base URL for all DNADNA plugins.

New plugins’ schemas can be found relative to this URL unless this attribute is explicitly overridden by the class implementing the plugin.

class dnadna.datasets.SNPSource[source]

Bases: Pluggable

A “SNPSource” is a class for loading SNPSample objects from some data source.

Subclasses of this class represent different data formats from which samples can be loaded.

This is in a way “lower-level” than DNADataset. DNADataset is an abstraction that loads SNPSamples from a data source, possibly performs some transforms on them, and returns them. From the point of view of DNADataset the actual on-disk format from which the samples are read is abstracted out to SNPSource.

In fact it may not even be an “on-disk” format; for example one could implement a SNPSource plugin that loads samples from an S3 bucket.

The “main” implementation of SNPSource is NpzSNPSource which loads samples organized on disk in the “dnadna” format. The other built-in implementations include:

  • FileListSNPSource – a simple format that simply reads a list of SNPSamples from a list of filenames; this is used primarily by the dnadna predict command for reading in a list of files on which to make predictions.

  • DictSNPSource – used primarily for testing, it can read samples from a JSON-compatible dict format; see its documentation for more details.

classmethod from_config(config, validate=True)[source]

Instantiate an SNPSource from dataset Config matching the dataset schema.

Although configuration specific to a given SNPSource subclass may have its own format-specific schema, these are still passed the full dataset config, which may contain additional properties (such as data_root) that might be useful to a given format.

Subclasses should implement this method in order to specify how to instantiate it from a config file; otherwise it cannot be used as a configurable plugin.

classmethod from_config_file(filename, validate=True, **kwargs)[source]

Like from_config but given a filename instead of a Config object.

The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.

name = 'snp_source'

The user-facing name of the plugin, which can be provided by a user implementing a plugin.

Typically it is automatically the same as the internal Pluggable._name but users are free to provide their own custom name here when implementing a plugin.

pluggable

alias of SNPSource

plugin_url = 'py-obj:dnadna.schemas.plugins.snp_source'

Base URL for all DNADNA plugins.

New plugins’ schemas can be found relative to this URL unless this attribute is explicitly overridden by the class implementing the plugin.